Where do you think autumn-spawning Atlantic herring (Clupea harengus) spawns? In order to approximate the spawning distribution of Atlantic herring, species distribution models were created for larvae of herring.
We collect data of Atlantic herring larvae occurrences from 2000 - 2020 from the International Herring Larvae Surveys (IHLS). The occurrence dataset is read from a parquet file stored in the data lake.
# the file to process
acf <- S3FileSystem$create(
anonymous = T,
scheme = "https",
endpoint_override = "s3.waw3-1.cloudferro.com"
)
eurobis <- arrow::open_dataset(acf$path("emodnet/biology/eurobis_occurence_data/eurobisgeoparquet/eurobis_no_partition_sorted.parquet" ))
# Atlantic herring has the aphiaID 126417
# the specific dataset international Herring Larvae Surveys has the id 4423
df_occs <- eurobis |>
filter(aphiaidaccepted==126417, datasetid==4423,
longitude > -12, longitude < 10,
latitude > 48, latitude < 62,
observationdate >= as.POSIXct("2000-01-01"),
observationdate <= as.POSIXct("2020-12-31")) |>
collect()
glimpse(df_occs)
## Rows: 7,902
## Columns: 10
## $ occurrenceid <int> 18246122, 18144150, 18003290, 18245864, 182458…
## $ datasetid <int> 4423, 4423, 4423, 4423, 4423, 4423, 4423, 4423…
## $ observationdate <dttm> 2016-01-18, 2013-01-14, 2009-01-18, 2015-09-2…
## $ longitude <dbl> 4.500000, 4.500000, 4.500000, -3.833333, -3.83…
## $ latitude <dbl> 52.41667, 52.55000, 52.58300, 58.58333, 58.750…
## $ scientificname <chr> "Clupea harengus", "Clupea harengus", "Clupea …
## $ aphiaid <int> 126417, 126417, 126417, 126417, 126417, 126417…
## $ scientificname_accepted <chr> "Clupea harengus", "Clupea harengus", "Clupea …
## $ aphiaidaccepted <int> 126417, 126417, 126417, 126417, 126417, 126417…
## $ taxonrank <int> 220, 220, 220, 220, 220, 220, 220, 220, 220, 2…
# select relevant variables from occurrence dataset
df_occs <- df_occs %>%
dplyr::select(Latitude=latitude,
Longitude=longitude,
Time=observationdate) %>%
mutate(year = year(Time),
month = month(Time))
# plot
mapview(df_occs %>% dplyr::select(Longitude) %>% pull,
df_occs %>% dplyr::select(Latitude) %>% pull,
crs = "epsg:4326")
We start with 7902 occurrences from the IHLS. These occurrences are present in the following months:
table(df_occs$month)
##
## 1 9 10 12
## 2293 4480 99 1030
df_occs <- df_occs %>%
distinct(year, month, Longitude, Latitude, .keep_all = TRUE)
After removal of duplicates, 5347 occurrences are left.
To account for sampling bias in the occurrence dataset, a spatial filtering technique is applied (Vollering et al., 2019). Here we thinned the occurrences so that each pair of occurrences has a minimum distance of 10 nautical miles or 18.52 km. This distance is the recommended distance between valid hauls in the DATRAS trawl surveys (ICES, 2015).
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(123)
df_occs_thinned <- df_occs[0,] %>% dplyr::select(-Time)
# loop through all timesteps and thin the dataset at each step using parallel processing
df_occs_thinned <-
foreach(y = 1:length(2000:2020), .combine = rbind) %:%
foreach(m = 1:12, .combine = rbind, .packages = "dplyr") %dopar% {
tmp_df <- df_occs |> filter(year == c(2000:2020)[y], month == m)
tmp_df_thinned <- spThin::thin(mutate(tmp_df, species = "Atlantic herring"),
lat.col = "Latitude", long.col = "Longitude",
spec.col = "species", thin.par = 18.52, reps = 1,
write.files = FALSE, locs.thinned.list.return = TRUE,
verbose = FALSE)[[1]]
tmp_df_thinned |> mutate(
year = c(2000:2020)[y],
month = m)
}
save(df_occs_thinned, file = "data/df_thinned.Rdata")
The thinning of the occurrences results in a remaining number of 3998 observations.
load("data/df_thinned.Rdata")
mapview(df_occs_thinned %>% dplyr::select(Longitude) %>% pull,
df_occs_thinned %>% dplyr::select(Latitude) %>% pull,
crs = "epsg:4326")
After dealing with spatial bias, 3998 occurrences remain for modelling.
We will use Maximum entropy (Maxent) models to fit the habitat suitability of herring larvae (S. J. Phillips et al., 2004). This approach makes use of background points instead of (pseudo-)absences to characterize the study area (S. J. Phillips et al., 2009). The following chunk derives 10000 background points from within a buffer of 100 km around the occurrences (Barve et al., 2011).
set.seed(123)
abs <- spatiotemp_pseudoabs(spatial.method = "buffer", temporal.method = "random",
occ.data = df_occs_thinned %>% mutate(x = Longitude, y = Latitude),
temporal.ext = c("2000-01-01", "2020-12-31"), spatial.buffer = 100000,
n.pseudoabs = 10000)
glimpse(abs)
For this example, we do not extrapolate the model to months where we have no information on the larvae. For this reason, the monthly values of the background points are limited to the months where occurrences of larvae are present.
abs <- abs %>%
mutate(month = sample(unique(df_occs_thinned$month), nrow(abs), replace = TRUE)) %>%
mutate(Longitude = x, Latitude = y) %>%
dplyr::select(-day, -x, -y)
glimpse(abs)
save(abs, file = "zarr_extraction/absences_save.Rdata")
At each occurrence and background point, we want to know what the value is of the environmental variables at this location and time.
Traditionally, this is proceeded by a cumbersome process of looking for and downloading relevant maps for each parameter. Consequently, environmental values could be extracted at point locations from these maps. Since lots of environmental data are now stored in the datalake of EDITO as .zarr files. These .zarr files can be accessed efficiently, directly at the location and time needed.
The spatial resolution of environmental variables should match the resolution of the species records during species distribution modelling (Sillero & Barbosa, 2021). To solve this, environmental maps are traditionally resampled to the same resolution as the species records. The extraction method from .zarr files on EDITO allows the user to provide a buffer. When a buffer is provided, environmental values are not only extracted at the requested longitude, latitude and time coordinate but from all cells within the specified buffer around the location. Next, a user specified function (i.e. mean, max, min, median) is applied to these values to summarize this information into one value for each requested longitude, latitude and time coordinate. This way, an uncertainty is added to the location of the occurrence / (pseudo-)absence / background point. In this example, we apply a buffer of 10 NM, which is the resolution of the occurrence data.
load("zarr_extraction/SAVE/absences_save.Rdata")
#combine occurrences and background points
df_occ_bg <- rbind(df_occs_thinned %>% mutate(presence = 1),
abs %>% mutate(presence = 0))
glimpse(df_occ_bg)
source("zarr_extraction/editoTools.R")
options("outputdebug"=c('L','M'))
load(file = "zarr_extraction/editostacv2.par")
#the requested timestep resolution of the dataset in milliseconds
#in this case we work with monthly data (1 month = 30.436875*24*3600*1000 = 2629746000 milliseconds)
timeSteps=c(2629746000)
##TODO: ADD ELEVATION, WINDFARMS AND SEABED SUBSTRATE ENERGY ------
parameters = list("thetao" = c("par" = "thetao", "fun" = "mean", "buffer" = "18520"),
"so" = c("par" = "so", "fun" = "mean", "buffer" = "18520"),
"zooc" = c("par" = "zooc", "fun" = "mean", "buffer" = "18520"),
"phyc" = c("par" = "phyc", "fun" = "mean", "buffer" = "18520"))
#check if they are all available in the data lake
for (parameter in parameters) {
param = ifelse(length(parameter) == 1, parameter, parameter["par"])
if(! param %in% unique(EDITOSTAC$par)) dbl("Unknown parameter ", param)
}
#extract function (requires POSIXct Time column)
df_occ_bg_env = enhanceDF(inputPoints = df_occ_bg %>%
mutate(Time = as.POSIXct(paste(year,month,1,sep = "-"))),
requestedParameters = parameters,
requestedTimeSteps = timeSteps,
stacCatalogue = EDITOSTAC,
verbose="on")
glimpse(df_occ_bg_env)
df_occ_bg_env <- df_occ_bg_env %>% dplyr::select(Longitude, Latitude, year, month,
thetao, so, zooc, phyc)
# remove observations where no environmental values were present
df_occ_bg_env <- drop_na(df_occ_bg_env)
table(df_occ_bg_env$presence)
# 0 1
# 7617 3910
## TODO remove this part ----
substr_lvl <- tibble(sub_char = c("Fine mud", "Sand", "Muddy sand", "Mixed sediment",
"Coarse substrate","Sandy mud or Muddy sand", "Seabed",
"Rock or other hard substrata","Sandy mud", "Sandy mud or Muddy sand ",
"Sediment","Fine mud or Sandy mud or Muddy sand"),
seabed_substrate = c(1:12))
energy_lvl <- tibble(ene_char = c("High energy", "Moderate energy", "Low energy", "No energy information"),
seabed_energy = c(1:4))
df_occ_bg_env <- df_occ_bg_env %>%
left_join(energy_lvl, by = "seabed_energy") %>%
left_join(substr_lvl, by = "seabed_substrate") %>%
dplyr::select(-seabed_energy, -seabed_substrate) %>%
mutate(windfarms = as.factor(windfarms))
save(df_occ_bg_env, file = "zarr_extraction/SAVE/pres_abs_env_save.Rdata")
Maxent can be tailored by combining different feature classes and regularization multipliers (S. B. Phillips et al., 2006). Fifteen combinations were tested using the corrected Akaike’s Information Criterion (AICc) as a selection criterion (Zeng et al., 2016).
load(file = "zarr_extraction/SAVE/pres_abs_env_save.Rdata")
glimpse(df_occ_bg_env)
# input for ENMevaluate requires lon & lat as first covariates
#test different model settings using ENMeval
model_fit <- ENMeval::ENMevaluate(occs = df_occ_bg_env %>%
filter(presence == 1) %>%
dplyr::select(Longitude, Latitude, depth, Phyto, SSS, SST, windfarms, ZooPl, ene_char, sub_char),
bg = df_occ_bg_env %>%
filter(presence == 0) %>%
dplyr::select(Longitude, Latitude, depth, Phyto, SSS, SST, windfarms, ZooPl, ene_char, sub_char),
tune.args = list(fc = c("L","LQ","LQH"),
rm = c(1,2,4,8,32)),
algorithm = "maxnet",
partitions = "randomkfold",
categoricals = c("sub_char", "ene_char", "windfarms"),
doClamp = TRUE,
parallel = TRUE)
save(model_fit, file = "SAVE/model_fit_save.Rdata")
The default output of the ENMeval package shows the Area Under the Curve of the Receiver Operating Characteristic plot (AUC) metric as an evaluation metric. This metric was derived using a 5-fold cross validation.
load("zarr_extraction/SAVE/model_fit_save.Rdata")
print(model_fit %>% eval.results() %>% filter(delta.AICc == 0))
## fc rm tune.args auc.train cbi.train auc.diff.avg auc.diff.sd auc.val.avg
## 1 LQH 1 fc.LQH_rm.1 0.8090342 NA 0.003626047 0.003206 0.8084254
## auc.val.sd cbi.val.avg cbi.val.sd or.10p.avg or.10p.sd or.mtp.avg
## 1 0.003942938 NA NA 0.1025575 0.01737908 0.0002557545
## or.mtp.sd AICc delta.AICc w.AIC ncoef
## 1 0.0005718844 70485.62 0 1 32
The outcomes show that the lowest AICc score is achieved using a regularization multiplier of 1 and the feature combinations of linear, quadratic and hinge. Using 5-fold cross-validation, the model has an AUC score of 0.81, which implies a good fit according to Krzanowski & Hand (2009).
Visualize the partial response curves of the occurrence of larvae of herring per environmental variable, using a bootstrapping method derived from Thuiller et al. (2009).
load(file = "zarr_extraction/SAVE/pres_abs_env_save.Rdata")
#derive optimal model results
eval_res <- model_fit
opt.aicc <- eval.results(model_fit) %>% filter(delta.AICc == 0)
mod <- eval_res@models[[which(names(eval_res@models) == opt.aicc$tune.args)]]
#variables as defined in the model
vs <- c("depth", "SST", "SSS", "Phyto", "ZooPl", "windfarms", "sub_char", "ene_char")
#variables names for the plot
name_key <- data.frame(old = c("depth", "SST", "SSS", "Phyto", "ZooPl", "windfarms", "sub_char", "ene_char"),
new = c("Depth (m)%", "Sea surface temperature (°C)%", "Sea surface salinity (PSU)%",
"Phytoplankton concentration%(mmol C / m³)", "Zooplankton concentration%(g C / m²)",
"Windfarm presence", "Seabed substrate", "Seabed energy"))
#function to add a line break at %'s inputs
addline_format <- function(x,...){
gsub('%','\n',x)
}
#loops through all variables and creates a plot
for (i in 1:nrow(name_key)) {
v <- name_key$old[i]
out_name <- name_key$new[i]
dat <- response.plot(mod, v, type = "cloglog",
ylab = "Probability of occurrence",
plot = F)
if(is.character(dat[1,1])) {
assign(v, ggplot(dat) +
geom_bar(aes_string(x = v, y = "pred"), stat='identity', fill = "#332288") +
scale_y_continuous(limits = c(0,1)) +
coord_flip() +
labs(title = out_name, x = "", y = "Probability of presence") +
theme_bw() +
theme(legend.title = element_blank(),
plot.title = element_text(size=10, face = "bold", colour = "black", hjust = 0.5),
axis.text.y = element_text(size=10, face = "plain", colour = "black"),
axis.text.x = element_text(size=10, face = "plain", colour = "black"),
axis.title.x = element_text(size=10, face = "bold", colour = "black"),
axis.title.y = element_text(size=10, face = "bold", colour = "black"),
legend.text = element_text(size=10, face = "bold", colour = "black")))
} else {
#define plot bounds (restricted to where occurrences are present)
min <- df_occ_bg_env %>% filter(presence == 1) %>% dplyr::select(all_of(v)) %>% min
max <- df_occ_bg_env %>% filter(presence == 1) %>% dplyr::select(all_of(v)) %>% max
assign(v, ggplot(dat) +
geom_line(aes_string(x = v, y = "pred"), linewidth = 1, color = "#332288") +
labs(x = addline_format(out_name), y = "Probability of presence") +
scale_x_continuous(limits = c(min, max)) +
scale_y_continuous(limits = c(0,1)) +
theme_bw() +
theme(legend.title = element_text(size=10, face = "bold", colour = "black"),
legend.text = element_text(size=10, face = "plain", colour = "black"),
axis.text.y = element_text(size=10, face = "plain", colour = "black"),
axis.text.x = element_text(size=10, face = "plain", colour = "black"),
axis.title.x = element_text(size=10, face = "bold", colour = "black"),
axis.title.y = element_text(size=10, face = "bold", colour = "black")))
}
}
grid.arrange(grobs = list(depth, SST + xlab(""), SSS + xlab(""),
Phyto, ZooPl + xlab(""), windfarms + xlab(""),
ene_char, sub_char),
layout_matrix = rbind(c(1,1,2,2,3,3),
c(4,4,5,5,6,6),
c(7,7,7,8,8,8)))

Finally, we want to make spatial predictions for the habitat suitability in the Northeast Atlantic. This is done by combining the created model with maps of the environmental variables. We will make a projection of the model for January 2020 here.
First, maps are derived from the .zarr files in the datalake at a given time and between a set of longitude and latitudes (defined by the study area).
variables <- c("depth", "SST", "SSS", "Phyto",
"ZooPl", "windfarms", "seabed_substrate",
"seabed_energy")
files <- tibble(name = list.files("data/raster_slices/", pattern = ".tif|.grd", full.names = TRUE),
parameter = str_extract(name, pattern = paste0(variables, collapse = "|")),
month = str_extract(name, pattern = "\\d{2}|\\d"))
#get model
eval_res <- model_fit
opt.aicc <- eval.results(model_fit) %>% filter(delta.AICc == 0)
mod <- eval_res@models[[which(names(eval_res@models) == opt.aicc$tune.args)]]
# variables <- c("thetao", "so", "phyc")
#
# raster_list <- list()
#
# for (i in 1:length(variables)) {
# raster_list[[i]] <- getRasterSlice2(variables[i],
# stacCatalogue = EDITOSTAC,
# lon_min = -12,
# lon_max = 10,
# lat_min = 48,
# lat_max = 62,
# requestedTimeSteps = NA,
# date = "2020-01-01")
# }
#
#
# resolutions <- sapply(raster_list, res, simplify = TRUE)
# coarsest_resolution <- resolutions[, which.max(apply(resolutions, 2, function(x) max(x, na.rm = TRUE)))]
#
# #This is the common most coarse resolution
# coarsest_resolution
# r_coarsest_resolution <- raster_list[[which.max(apply(resolutions, 2, function(x) max(x, na.rm = TRUE)))]]
#
# l <- lapply(raster_list, FUN = function(x) raster(resample(x, r_coarsest_resolution)))
# st <- stack(l)
#
# plot(st)
Spatial predictions are restricted to the months were occurrence data of herring larvae is available (September, October, December and January). This is the same period where autumn-spawning herring spawn.
##TODO: delete this part
substr_lvl <- tibble(sub_char = c("Fine mud", "Sand", "Muddy sand", "Mixed sediment",
"Coarse substrate","Sandy mud or Muddy sand", "Seabed",
"Rock or other hard substrata","Sandy mud", "Sandy mud or Muddy sand ",
"Sediment","Fine mud or Sandy mud or Muddy sand"),
seabed_substrate = c(1:12))
energy_lvl <- tibble(ene_char = c("High energy", "Moderate energy", "Low energy", "No energy information"),
seabed_energy = c(1:4))
predictions <- stack()
for (m in unique(df_occ_bg_env$month)) {
plot_st <- stack(files %>%
filter(month == m | is.na(month)) %>%
dplyr::select(name) %>%
pull())
names(plot_st) <- str_extract(names(plot_st), pattern = paste0(variables, collapse = "|"))
names(plot_st)[which(names(plot_st) == "seabed_energy")] <- "ene_char"
names(plot_st)[which(names(plot_st) == "seabed_substrate")] <- "sub_char"
pred_m <- predict(plot_st, mod, clamp=T, type="cloglog",
factors = list(ene_char = factor(energy_lvl$seabed_energy,
labels = energy_lvl$ene_char,
levels = energy_lvl$seabed_energy),
sub_char = factor(substr_lvl$seabed_substrate,
labels = substr_lvl$sub_char,
levels = substr_lvl$seabed_substrate)))
# crop to extent of study area
pred_m <- crop(pred_m, extent(-12, 10, 48, 62))
names(pred_m) <- paste0("prediction_", month.name[m])
predictions <- stack(predictions, pred_m)
}
gplot(predictions) +
geom_tile(aes(fill = value)) +
facet_wrap(~ variable) +
labs(x = "", y = "") +
scale_fill_gradientn(colours = hcl.colors(225, "Viridis")) +
coord_equal() +
theme_minimal()
